Case Study 2
Case Study 2
- Data import
- Check for missing data
- Pre-Processing
- Exploratory Analysis
- Helper Functions
- Attrition
- Monthly Income
- Age
- Business Travel
- Department
- Distance From Home
- Education
- Environment Satisfaction
- Gender
- Job Involvement
- Job Level
- Job Role
- Job Satisfaction
- Marital Status
- Num Companies Worked
- Over Time
- Performance Rating
- Relationship Satisfaction
- Stock Option Level
- Salary Hike
- Total Working Years
- Training Times Last Year
- Work Life Balance
- Years At Company
- Years In CurrentRole
- Years Since Last Promotion
- Years With Current Manager
- Daily, Hourly, and Monthly Rates
- Other Variables
- Additional Analysis
- Train/Test Split
- Variable Importance
- Linear Model
- Naive Bayes
- Random Forest (Attrition)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
Data import
Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 870 obs. of 36 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Age : num 32 40 35 32 24 27 41 37 34 34 ...
$ Attrition : chr "No" "No" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
$ DailyRate : num 117 1308 200 801 567 ...
$ Department : chr "Sales" "Research & Development" "Research & Development" "Sales" ...
$ DistanceFromHome : num 13 14 18 1 2 10 5 10 10 10 ...
$ Education : num 4 3 2 4 1 2 5 4 4 4 ...
$ EducationField : chr "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
$ EmployeeCount : num 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : num 859 1128 1412 2016 1646 ...
$ EnvironmentSatisfaction : num 2 3 3 3 1 4 2 4 3 4 ...
$ Gender : chr "Male" "Male" "Male" "Female" ...
$ HourlyRate : num 73 44 60 48 32 32 90 88 87 92 ...
$ JobInvolvement : num 3 2 3 3 3 3 4 2 3 2 ...
$ JobLevel : num 2 5 3 3 1 3 1 2 1 2 ...
$ JobRole : chr "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
$ JobSatisfaction : num 4 3 4 4 4 1 3 4 3 3 ...
$ MaritalStatus : chr "Divorced" "Single" "Single" "Married" ...
$ MonthlyIncome : num 4403 19626 9362 10422 3760 ...
$ MonthlyRate : num 9250 17544 19944 24032 17218 ...
$ NumCompaniesWorked : num 2 1 2 1 1 1 2 2 1 1 ...
$ Over18 : chr "Y" "Y" "Y" "Y" ...
$ OverTime : chr "No" "No" "No" "No" ...
$ PercentSalaryHike : num 11 14 11 19 13 21 12 14 19 14 ...
$ PerformanceRating : num 3 3 3 3 3 4 3 3 3 3 ...
$ RelationshipSatisfaction: num 3 1 3 3 3 3 1 3 4 2 ...
$ StandardHours : num 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : num 1 0 0 2 0 2 0 3 1 1 ...
$ TotalWorkingYears : num 8 21 10 14 6 9 7 8 1 8 ...
$ TrainingTimesLastYear : num 3 2 2 3 2 4 5 5 2 3 ...
$ WorkLifeBalance : num 2 4 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : num 5 20 2 14 6 9 4 1 1 8 ...
$ YearsInCurrentRole : num 2 7 2 10 3 7 2 0 1 2 ...
$ YearsSinceLastPromotion : num 0 4 2 5 1 1 0 0 0 7 ...
$ YearsWithCurrManager : num 3 9 2 7 3 7 3 0 0 7 ...
- attr(*, "spec")=
.. cols(
.. ID = col_double(),
.. Age = col_double(),
.. Attrition = col_character(),
.. BusinessTravel = col_character(),
.. DailyRate = col_double(),
.. Department = col_character(),
.. DistanceFromHome = col_double(),
.. Education = col_double(),
.. EducationField = col_character(),
.. EmployeeCount = col_double(),
.. EmployeeNumber = col_double(),
.. EnvironmentSatisfaction = col_double(),
.. Gender = col_character(),
.. HourlyRate = col_double(),
.. JobInvolvement = col_double(),
.. JobLevel = col_double(),
.. JobRole = col_character(),
.. JobSatisfaction = col_double(),
.. MaritalStatus = col_character(),
.. MonthlyIncome = col_double(),
.. MonthlyRate = col_double(),
.. NumCompaniesWorked = col_double(),
.. Over18 = col_character(),
.. OverTime = col_character(),
.. PercentSalaryHike = col_double(),
.. PerformanceRating = col_double(),
.. RelationshipSatisfaction = col_double(),
.. StandardHours = col_double(),
.. StockOptionLevel = col_double(),
.. TotalWorkingYears = col_double(),
.. TrainingTimesLastYear = col_double(),
.. WorkLifeBalance = col_double(),
.. YearsAtCompany = col_double(),
.. YearsInCurrentRole = col_double(),
.. YearsSinceLastPromotion = col_double(),
.. YearsWithCurrManager = col_double()
.. )
The data set has 870 observations and 36 variables for us to work with.
Check for missing data
# library(Amelia)
library(naniar)
# missmap(casedata,y.at=c(1),y.labels = c(''))
gg_miss_var(casedata, show_pct = TRUE) + labs(title = "Percent Missing by Data Field") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5),
axis.title.y = element_text(angle = 0, vjust = 1))There does not appear to be any missing data in the dataset.
Pre-Processing
Here I am doing some processing of the data. I am going to convert some character columns to factors to make modeling easier. I am also going to add variables to this step as I proceed with my analysis
colsToFactor <- c("Attrition", "BusinessTravel", "Department", "EducationField",
"Gender", "JobRole", "MaritalStatus", "Over18", "OverTime", "StockOptionLevel",
"JobLevel", "JobInvolvement", "Education", "EnvironmentSatisfaction", "JobSatisfaction",
"RelationshipSatisfaction", "WorkLifeBalance", "PerformanceRating")
# Consider 'StockOptionLevel','JobLevel','JobInvolvement','Education'; They
# could be coded as numeric
casedata[, colsToFactor] <- lapply(casedata[, colsToFactor], as.factor)
casedata$logMonthlyIncome <- log(casedata$MonthlyIncome)
casedata$IncomeLt4000 <- ifelse(casedata$MonthlyIncome <= 4000, 1, 0)
casedata$DistHomeFactor <- cut(casedata$DistanceFromHome, c(0, 10, 20, 30),
labels = c("Close", "Medium", "Far"), include.lowest = TRUE)
casedata$AgeGroup <- cut(casedata$Age, c(18, 25, 35, 45, 60), labels = c("18-25",
"25-35", "35-45", "45-60"), include.lowest = TRUE)
casedata$NumCompCat <- cut(casedata$NumCompaniesWorked, c(0, 2, 6, 9), labels = c("0-2",
"2-6", "6-9"), include.lowest = TRUE)
casedata$WorkingYearsGroup <- cut(casedata$TotalWorkingYears, c(0, 5, 10, 15,
20, 40), labels = c("0-5", "5-10", "10-15", "15-20", "20-40"), include.lowest = TRUE)
casedata$RoleYearsGroup <- cut(casedata$YearsInCurrentRole, c(0, 3, 6, 10, 20),
labels = c("0-3", "3-6", "6-10", "10+"), include.lowest = TRUE)
casedata$CompanyYearsGroup <- cut(casedata$YearsAtCompany, c(0, 3, 10, 20, 40),
labels = c("0-3", "3-10", "10-20", "20-40"), include.lowest = TRUE)
casedata$IncomeGroup <- cut(casedata$MonthlyIncome, c(0, 4000, 8000, 12000,
16000, 20000), labels = c("<$4K", "$4K - $8K", "$8K-$12K", "$12K-$16K",
"$16K-$20K"), include.lowest = TRUE)
vars <- c("MonthlyIncome", "Age", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole",
"YearsWithCurrManager", "YearsInCurrentRole", "YearsSinceLastPromotion")
# This is an issue because of zeroes
casedata <- casedata %>% mutate_at(vars, list(log = log))Exploratory Analysis
In this section I am going to look mostly at individual variables to try to get a sense of what each variable means and whether or not I should consider it for further analysis and modeling.
Helper Functions
This section is just to create some functions to help with exploratory analysis.
barPlot <- function(df, x) {
ggplot(df, aes_string(x = x)) + geom_bar()
}
propPlot <- function(df, x, y) {
ggplot(df, aes_string(x = x, fill = y)) + geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) + geom_abline(slope = 0,
intercept = 0.16)
}
fancyTable <- function(df, x) {
df %>% group_by_at(x) %>% summarise(Count = n(), Proportion = scales::percent(n()/dim(df)[1])) %>%
kable() %>% kable_styling(full_width = FALSE)
}
vioPlot <- function(df, x, y) {
ggplot(df, aes_string(x = x, y = y, fill = x)) + geom_violin(show.legend = FALSE) +
geom_boxplot(width = 0.2, show.legend = FALSE) + stat_summary(fun.y = mean,
geom = "point", shape = 5, size = 4, color = "black", show.legend = FALSE)
}
histPlot <- function(df, x) {
ggplot(casedata, aes_string(x)) + geom_histogram(bins = 30)
}
scatterPlot <- function(df = casedata, x, y = "MonthlyIncome") {
ggplot(df, aes_string(x = x, y = y)) + geom_point() + geom_smooth(method = "lm")
}Attrition
| Attrition | Count | Proportion |
|---|---|---|
| No | 730 | 83.9% |
| Yes | 140 | 16.1% |
Out of the 870 employees, 140 left their jobs, which is 16% attrition
Monthly Income
ggplot(casedata, aes(MonthlyIncome)) + geom_histogram(aes(y = ..density..),
bins = 30) + stat_function(fun = dnorm, color = "red", args = list(mean = mean(casedata$MonthlyIncome),
sd = sd(casedata$MonthlyIncome))) Min. 1st Qu. Median Mean 3rd Qu. Max.
1081 2840 4946 6390 8182 19999
We have Monthly Incomes ranging from $1,081 to $19,999. Income is very right skewed, which will effect modeling. We probably want to transform it.
ggplot(casedata, aes(logMonthlyIncome)) + geom_histogram(aes(y = ..density..),
bins = 30) + stat_function(fun = dnorm, color = "red", args = list(mean = mean(casedata$logMonthlyIncome),
sd = sd(casedata$logMonthlyIncome))) Min. 1st Qu. Median Mean 3rd Qu. Max.
6.986 7.951 8.506 8.538 9.010 9.903
The log version of Monthly Income seems more normal but wasn’t as good as I had hoped from the transformation.
Age
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 30.00 35.00 36.83 43.00 60.00
Business Travel
| BusinessTravel | Count | Proportion |
|---|---|---|
| Non-Travel | 94 | 10.8% |
| Travel_Frequently | 158 | 18.2% |
| Travel_Rarely | 618 | 71.0% |
681 employees or 71% travel rarely. It seems like the most frequent travelers have the highest attrition rates, and non-travelers have the lowest. Non-travelers have a lower 75th percentile on income, Frequent and non-frequent travelers have similar pay.
Department
| Department | Count | Proportion |
|---|---|---|
| Human Resources | 35 | 4.02% |
| Research & Development | 562 | 64.6% |
| Sales | 273 | 31.4% |
The sales department has the highest attrition rate and R&D has the lowest attrition rate. Mean income is fairly similar but their is a clear difference in medians with HR having the lowest median pay and Sales having the highest median pay.
Distance From Home
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 7.000 9.339 14.000 29.000
Next I made a discretized version
| DistHomeFactor | Count | Proportion |
|---|---|---|
| Close | 600 | 69.0% |
| Medium | 143 | 16.4% |
| Far | 127 | 14.6% |
Education
| Education | Count | Proportion |
|---|---|---|
| 1 | 98 | 11.3% |
| 2 | 182 | 20.9% |
| 3 | 324 | 37.2% |
| 4 | 240 | 27.6% |
| 5 | 26 | 2.99% |
Assuming higher values of Education mean more advance education, better educated employees seem to have lower attrtion rates. 1-3 are similar but 4 & 5 have a clear decrease. Education 5 also has a very clear pay advantage.
| EducationField | Count | Proportion |
|---|---|---|
| Human Resources | 15 | 1.72% |
| Life Sciences | 358 | 41.1% |
| Marketing | 100 | 11.5% |
| Medical | 270 | 31.0% |
| Other | 52 | 5.98% |
| Technical Degree | 75 | 8.62% |
Those with education in HR have the highest attrition rates in this sample. Education in Human Resources seems to come with lower median pay, but those educated in marketing have higher median pay
Environment Satisfaction
| EnvironmentSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 172 | 19.8% |
| 2 | 178 | 20.5% |
| 3 | 258 | 29.7% |
| 4 | 262 | 30.1% |
Assuming lower is less satisfied, those least satisfied with their enviroment have the highest attrition rates. Mean and Median pay is similar between groups here, but the 75th percentile is lower for groups 2&4. Not sure how this is meaningful to pay overall.
Gender
| Gender | Count | Proportion |
|---|---|---|
| Female | 354 | 40.7% |
| Male | 516 | 59.3% |
Attrition rates are similar between genders, with Men having just a slightly higher rate. In terms of pay, Men are making a little less than average in this sample.
Job Involvement
| JobInvolvement | Count | Proportion |
|---|---|---|
| 1 | 47 | 5.40% |
| 2 | 228 | 26.2% |
| 3 | 514 | 59.1% |
| 4 | 81 | 9.31% |
Those with lower job involvement have much higher attrition. The relationship with pay is much less clear.
Job Level
| JobLevel | Count | Proportion |
|---|---|---|
| 1 | 329 | 37.8% |
| 2 | 312 | 35.9% |
| 3 | 132 | 15.2% |
| 4 | 60 | 6.90% |
| 5 | 37 | 4.25% |
Job level 1 has a much higher attrition rate. Job level also seems very linearly associated with pay.
Job Role
| JobRole | Count | Proportion |
|---|---|---|
| Healthcare Representative | 76 | 8.74% |
| Human Resources | 27 | 3.10% |
| Laboratory Technician | 153 | 17.6% |
| Manager | 51 | 5.86% |
| Manufacturing Director | 87 | 10.0% |
| Research Director | 51 | 5.86% |
| Research Scientist | 172 | 19.8% |
| Sales Executive | 200 | 23.0% |
| Sales Representative | 53 | 6.09% |
Sales Representatives have the highest attrition rate by far. Managers and Research Directors have the highest pay and low attrition rates.
Job Satisfaction
| JobSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 179 | 20.6% |
| 2 | 166 | 19.1% |
| 3 | 254 | 29.2% |
| 4 | 271 | 31.1% |
Marital Status
| MaritalStatus | Count | Proportion |
|---|---|---|
| Divorced | 191 | 22.0% |
| Married | 410 | 47.1% |
| Single | 269 | 30.9% |
Single employees have the highest attrition and below average pay.
Num Companies Worked
I would like to think this is the number of companies worked before this one because the minimum is zero, so those in that category would have only worked at this company. Otherwise I don’t know how zero makes sense unless we are counting employees who have been hired but not started.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.728 4.000 9.000
It seems like those that attrite have worked at fewer companies?
Over Time
| OverTime | Count | Proportion |
|---|---|---|
| No | 618 | 71.0% |
| Yes | 252 | 29.0% |
Overtime eligible employees have much higher attrition rates and slightly lower monthly income.
Performance Rating
| PerformanceRating | Count | Proportion |
|---|---|---|
| 3 | 738 | 84.8% |
| 4 | 132 | 15.2% |
Relationship Satisfaction
| RelationshipSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 174 | 20.0% |
| 2 | 171 | 19.7% |
| 3 | 261 | 30.0% |
| 4 | 264 | 30.3% |
Stock Option Level
| StockOptionLevel | Count | Proportion |
|---|---|---|
| 0 | 379 | 43.6% |
| 1 | 355 | 40.8% |
| 2 | 81 | 9.31% |
| 3 | 55 | 6.32% |
Most employees fall in Stock Option level 0 or 1, with much fewer in 2 & 3. Stock option levels 0 & 3 have the highest attrition, with 1 & 2 being much lower. Perhaps 1 & 2 have more long term incentives to keep employees at the firm. 1 & 2 also have higher median / average pay.
Salary Hike
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.0 12.0 14.0 15.2 18.0 25.0
Total Working Years
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 6.00 10.00 11.05 15.00 40.00
Training Times Last Year
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 3.000 2.832 3.000 6.000
Work Life Balance
| WorkLifeBalance | Count | Proportion |
|---|---|---|
| 1 | 48 | 5.52% |
| 2 | 192 | 22.1% |
| 3 | 532 | 61.1% |
| 4 | 98 | 11.3% |
Years At Company
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 3.000 5.000 6.962 10.000 40.000
Years In CurrentRole
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 3.000 4.205 7.000 18.000
Years Since Last Promotion
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 1.000 2.169 3.000 15.000
Years With Current Manager
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.00 3.00 4.14 7.00 17.00
Daily, Hourly, and Monthly Rates
library(GGally)
ggpairs(data = casedata, mapping = aes(color = Attrition), columns = c("HourlyRate",
"DailyRate", "MonthlyRate", "MonthlyIncome"))I am not seeing any meaningful relationship to attrition or Monthly Income for Daily Rate
I am not seeing any meaningful relationship to attrition or Monthly Income for Hourly Rate, and Hourly Rate doesn’t seem related to Daily Rate. I would normally assoicate Hourly/Daily rates with income but in this case either this is not true or somehow doesn’t translate to monthly income. I will probably ignore these.
Other Variables
- ID
- EmployeeNumber
- EmployeeCount
- Over18
- StandardHours
- EmployeeNumber
These variables are not particularly useful for analysis because they either are individual IDs or do not vary. Every row is one employee, every employee is over 18, and every employee has standard hours of 80. 18 is a standard working age, 80 hours would seem high if it was a weekly number but don’t have any information about it other than it is constant.
Additional Analysis
Satisfaction
I want to look at cross sections of the satisfaction variables to see what happens when people are dissatisfied in more than one area.
casedata$TotalSatisfaction <- as.factor(as.numeric(casedata$EnvironmentSatisfaction) +
as.numeric(casedata$JobSatisfaction) + as.numeric(casedata$RelationshipSatisfaction))
propPlot(casedata, x = "TotalSatisfaction", "Attrition")casedata$TotalSatisfaction <- as.factor(round((as.numeric(casedata$EnvironmentSatisfaction) +
as.numeric(casedata$JobSatisfaction) + as.numeric(casedata$RelationshipSatisfaction))/3))
propPlot(casedata, x = "TotalSatisfaction", "Attrition")| TotalSatisfaction | Count | Proportion |
|---|---|---|
| 1 | 24 | 2.76% |
| 2 | 296 | 34.0% |
| 3 | 455 | 52.3% |
| 4 | 95 | 10.9% |
So the most dissatisfied people leave at much higher rates and the most satisfied rarely leave. That said the majority fit into the average satifaction bucket (“3”) so I am not sure how much performance it will add to the attrition model.
g <- propPlot(casedata, x = "JobRole", "EnvironmentSatisfaction") + theme(axis.text.x = element_text(angle = 90))
g$layers[[2]] <- NULL
gg <- propPlot(casedata, x = "JobRole", "JobSatisfaction") + theme(axis.text.x = element_text(angle = 90))
g$layers[[2]] <- NULL
gg <- propPlot(casedata, x = "JobRole", "RelationshipSatisfaction") + theme(axis.text.x = element_text(angle = 90))
g$layers[[2]] <- NULL
gPerformance and Salary
There are only two ratings, I wanted confirmation that 4 is a better rating, and this plot shows those that have a rating of 4 get larger salary increases.
## Definig a pallette so these groups not mistaken for attrition
cbbPalette <- c("#0072B2", "#D55E00")
vioPlot(casedata, x = "PerformanceRating", y = "PercentSalaryHike") + scale_fill_manual(values = cbbPalette)Gender Income differences by Department, Education, and Job Role
vioPlot(casedata, x = "Gender", y = "MonthlyIncome") + facet_wrap(~Department) +
scale_fill_manual(values = cbbPalette)Men have a little better median pay in the Sales department.
vioPlot(casedata, x = "Gender", y = "MonthlyIncome") + facet_wrap(~EducationField) +
scale_fill_manual(values = cbbPalette)In terms of mean and median pay, men rarely have an advantage regardless of education field.
vioPlot(casedata, x = "Gender", y = "MonthlyIncome") + facet_wrap(~JobRole) +
scale_fill_manual(values = cbbPalette)vioPlot(casedata, x = "Gender", y = "MonthlyIncome") + facet_wrap(~JobLevel) +
scale_fill_manual(values = cbbPalette)Train/Test Split
Variable Importance
Trying to get an idea of what variables might be important for modeling
library(randomForest)
colsToDrop <- c("ID", "EmployeeNumber", "EmployeeCount", "Over18", "StandardHours",
"DailyRate", "HourlyRate", "MonthlyRate")
colsToDrop2 <- names(casedata)[c(37:length(casedata))]
rf.train <- train.data %>% select(-c(colsToDrop, colsToDrop2))
fit.rf.01 = train(Attrition ~ ., data = rf.train, method = "rf", importance = TRUE)
varImp(fit.rf.01$finalModel) No Yes
Age 6.52185425 6.52185425
BusinessTravelTravel_Frequently -0.26541400 -0.26541400
BusinessTravelTravel_Rarely 0.46717934 0.46717934
DepartmentResearch & Development 1.19310689 1.19310689
DepartmentSales 2.55332552 2.55332552
DistanceFromHome 1.98471359 1.98471359
Education2 -1.06722009 -1.06722009
Education3 -1.40306383 -1.40306383
Education4 -0.01705006 -0.01705006
Education5 -1.88722037 -1.88722037
EducationFieldLife Sciences 0.30416695 0.30416695
EducationFieldMarketing 4.71069204 4.71069204
EducationFieldMedical 1.04794352 1.04794352
EducationFieldOther -2.80707947 -2.80707947
EducationFieldTechnical Degree -0.56185017 -0.56185017
EnvironmentSatisfaction2 -0.88218182 -0.88218182
EnvironmentSatisfaction3 -0.51665115 -0.51665115
EnvironmentSatisfaction4 2.27915035 2.27915035
GenderMale -0.82319377 -0.82319377
JobInvolvement2 -2.00022141 -2.00022141
JobInvolvement3 0.85106244 0.85106244
JobInvolvement4 -0.10973691 -0.10973691
JobLevel2 1.44956652 1.44956652
JobLevel3 -0.40884542 -0.40884542
JobLevel4 -0.23505241 -0.23505241
JobLevel5 -1.19875153 -1.19875153
JobRoleHuman Resources -0.29187424 -0.29187424
JobRoleLaboratory Technician 1.59762428 1.59762428
JobRoleManager -0.29298418 -0.29298418
JobRoleManufacturing Director 0.75891419 0.75891419
JobRoleResearch Director 1.65859574 1.65859574
JobRoleResearch Scientist 2.99278442 2.99278442
JobRoleSales Executive -0.85877930 -0.85877930
JobRoleSales Representative 1.90627863 1.90627863
JobSatisfaction2 -0.50095252 -0.50095252
JobSatisfaction3 -0.04278634 -0.04278634
JobSatisfaction4 2.64712819 2.64712819
[ reached 'max' / getOption("max.print") -- omitted 22 rows ]
fit.rf.02 = train(MonthlyIncome ~ ., data = rf.train, method = "rf", importance = TRUE)
varImp(fit.rf.02)rf variable importance
only 20 most important variables shown (out of 59)
Overall
JobLevel3 100.000
JobLevel2 86.050
TotalWorkingYears 44.842
JobRoleResearch Director 35.647
JobLevel4 31.699
JobLevel5 24.098
JobRoleManager 22.089
JobRoleLaboratory Technician 22.035
JobRoleSales Executive 18.454
YearsAtCompany 15.715
YearsInCurrentRole 14.394
JobRoleResearch Scientist 13.504
JobRoleManufacturing Director 12.843
Age 12.065
JobRoleSales Representative 10.289
DepartmentSales 10.162
BusinessTravelTravel_Rarely 9.984
Education5 9.708
NumCompaniesWorked 9.325
YearsWithCurrManager 9.218
colsToDrop3 <- names(casedata)[c(38:length(casedata))]
rf.train <- train.data %>% select(-c(colsToDrop, colsToDrop3, "MonthlyIncome"))
fit.rf.03 = train(logMonthlyIncome ~ ., data = rf.train, method = "rf", importance = TRUE)
varImp(fit.rf.03)rf variable importance
only 20 most important variables shown (out of 59)
Overall
JobLevel2 100.000
JobLevel3 78.826
TotalWorkingYears 59.557
JobRoleResearch Director 27.147
JobLevel4 25.509
JobRoleLaboratory Technician 25.354
JobRoleSales Executive 19.897
JobRoleManager 17.915
JobLevel5 17.901
YearsAtCompany 16.193
JobRoleResearch Scientist 15.634
Age 13.245
YearsInCurrentRole 13.118
JobRoleManufacturing Director 12.688
YearsWithCurrManager 12.278
YearsSinceLastPromotion 12.185
DepartmentSales 9.979
JobRoleSales Representative 9.898
JobSatisfaction2 9.486
BusinessTravelTravel_Rarely 9.188
Linear Model
colsToDrop <- c("ID", "EmployeeNumber", "EmployeeCount", "Over18", "StandardHours",
"DailyRate", "HourlyRate", "MonthlyRate")
colsToDrop2 <- names(casedata)[c(37:length(casedata))]
lm.train.01 <- train.data %>% select(-c(colsToDrop, colsToDrop2))
trnCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
fit.lm.01 <- train(MonthlyIncome ~ ., data = lm.train.01, method = "lm", metric = "RMSE",
trControl = trnCtrl)
broom::tidy(summary(fit.lm.01))# A tibble: 60 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3026. 827. 3.66 0.000272
2 Age -4.13 6.05 -0.683 0.495
3 AttritionYes -54.4 124. -0.438 0.661
4 BusinessTravelTravel_Frequently 50.2 149. 0.338 0.736
5 BusinessTravelTravel_Rarely 321. 126. 2.55 0.0110
6 `DepartmentResearch & Development` 344. 574. 0.600 0.549
7 DepartmentSales -190. 581. -0.328 0.743
8 DistanceFromHome -3.88 4.78 -0.813 0.416
9 Education2 17.9 142. 0.126 0.900
10 Education3 -20.2 131. -0.154 0.878
# ... with 50 more rows
[1] 1161.022
colsToDrop3 <- names(casedata)[c(38:length(casedata))]
lm.train.02 <- train.data %>% select(-c(colsToDrop, colsToDrop3, "MonthlyIncome"))
fit.lm.02 <- train(logMonthlyIncome ~ ., data = lm.train.02, method = "lm",
metric = "RMSE", trControl = trnCtrl)
broom::tidy(summary(fit.lm.02))# A tibble: 60 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 8.01e+0 0.178 45.0 3.49e-200
2 Age 3.90e-5 0.00130 0.0299 9.76e- 1
3 AttritionYes -5.93e-2 0.0267 -2.22 2.70e- 2
4 BusinessTravelTravel_Frequently 3.06e-2 0.0320 0.955 3.40e- 1
5 BusinessTravelTravel_Rarely 7.39e-2 0.0271 2.73 6.56e- 3
6 `DepartmentResearch & Developme~ 5.13e-2 0.124 0.415 6.78e- 1
7 DepartmentSales -1.05e-2 0.125 -0.0842 9.33e- 1
8 DistanceFromHome -4.83e-4 0.00103 -0.470 6.39e- 1
9 Education2 -3.74e-3 0.0306 -0.122 9.03e- 1
10 Education3 -2.18e-2 0.0282 -0.771 4.41e- 1
# ... with 50 more rows
[1] 1206.878
fit.lm.01 <- train(MonthlyIncome ~ JobRole + JobLevel + TotalWorkingYears, data = train.data,
method = "lm", metric = "RMSE", trControl = trnCtrl)
broom::tidy(summary(fit.lm.01)) %>% mutate_if(is.numeric, round, 5)# A tibble: 14 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3471. 197. 17.6 0
2 `JobRoleHuman Resources` -926. 298. -3.11 0.00197
3 `JobRoleLaboratory Technician` -1134. 190. -5.96 0
4 JobRoleManager 3184. 256. 12.5 0
5 `JobRoleManufacturing Director` 130. 172. 0.754 0.451
6 `JobRoleResearch Director` 3368. 224. 15.0 0
7 `JobRoleResearch Scientist` -900. 193. -4.66 0
8 `JobRoleSales Executive` -7.16 148. -0.0482 0.962
9 `JobRoleSales Representative` -1122. 236. -4.75 0
10 JobLevel2 1764. 151. 11.7 0
11 JobLevel3 5070. 207. 24.5 0
12 JobLevel4 8458. 307. 27.6 0
13 JobLevel5 11185. 362. 30.9 0
14 TotalWorkingYears 48.2 8.71 5.54 0
[1] 1105.254
fit.lm.02 <- train(logMonthlyIncome ~ JobRole + JobLevel + TotalWorkingYears,
data = train.data, method = "lm", metric = "RMSE", trControl = trnCtrl)
broom::tidy(summary(fit.lm.02)) %>% mutate_if(is.numeric, round, 5)# A tibble: 14 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 8.02 0.0424 189. 0
2 `JobRoleHuman Resources` -0.176 0.0642 -2.75 0.00617
3 `JobRoleLaboratory Technician` -0.227 0.0410 -5.53 0
4 JobRoleManager 0.260 0.0550 4.73 0
5 `JobRoleManufacturing Director` 0.0144 0.0370 0.390 0.697
6 `JobRoleResearch Director` 0.289 0.0483 5.98 0
7 `JobRoleResearch Scientist` -0.167 0.0416 -4.01 0.000070
8 `JobRoleSales Executive` -0.00218 0.0319 -0.0682 0.946
9 `JobRoleSales Representative` -0.267 0.0509 -5.25 0
10 JobLevel2 0.504 0.0325 15.5 0
11 JobLevel3 0.956 0.0445 21.5 0
12 JobLevel4 1.16 0.0660 17.5 0
13 JobLevel5 1.29 0.0779 16.6 0
14 TotalWorkingYears 0.0106 0.00187 5.64 0
[1] 1149.049
trnCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
fit.lm.03 <- train(MonthlyIncome ~ JobRole + JobLevel + TotalWorkingYears +
YearsAtCompany + YearsInCurrentRole + Age + Department + Education + BusinessTravel,
data = train.data, method = "lm", metric = "RMSE", trControl = trnCtrl)
broom::tidy(summary(fit.lm.03)) %>% mutate_if(is.numeric, round, 5)# A tibble: 25 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2891. 612. 4.72 0
2 `JobRoleHuman Resources` -590. 620. -0.952 0.341
3 `JobRoleLaboratory Technician` -1126. 189. -5.95 0
4 JobRoleManager 3414. 299. 11.4 0
5 `JobRoleManufacturing Director` 78.1 171. 0.456 0.648
6 `JobRoleResearch Director` 3342. 226. 14.8 0
7 `JobRoleResearch Scientist` -890. 192. -4.63 0
8 `JobRoleSales Executive` 528. 350. 1.51 0.132
9 `JobRoleSales Representative` -547. 395. -1.38 0.167
10 JobLevel2 1795. 153. 11.7 0
# ... with 15 more rows
[1] 1122.661
fit.lm.04 <- train(MonthlyIncome ~ JobRole + JobLevel + TotalWorkingYears +
BusinessTravel, data = train.data, method = "lm", metric = "RMSE", trControl = trnCtrl)
broom::tidy(summary(fit.lm.04)) %>% mutate_if(is.numeric, round, 5)# A tibble: 16 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3193. 221. 14.5 0
2 `JobRoleHuman Resources` -930. 295. -3.15 0.00171
3 `JobRoleLaboratory Technician` -1133. 189. -6.00 0
4 JobRoleManager 3161. 254. 12.4 0
5 `JobRoleManufacturing Director` 90.2 171. 0.528 0.598
6 `JobRoleResearch Director` 3343. 223. 15.0 0
7 `JobRoleResearch Scientist` -891. 191. -4.65 0
8 `JobRoleSales Executive` -25.8 147. -0.175 0.861
9 `JobRoleSales Representative` -1100. 235. -4.68 0
10 JobLevel2 1802. 150. 12.0 0
11 JobLevel3 5113. 206. 24.9 0
12 JobLevel4 8499. 305. 27.9 0
13 JobLevel5 11163. 359. 31.1 0
14 TotalWorkingYears 48.3 8.64 5.59 0
15 BusinessTravelTravel_Frequently 71.4 144. 0.496 0.620
16 BusinessTravelTravel_Rarely 352. 122. 2.89 0.00402
[1] 1120.712
fit.rf.04 = train(MonthlyIncome ~ JobRole + JobLevel + TotalWorkingYears + YearsAtCompany +
YearsInCurrentRole + Age + Department + Education + BusinessTravel, data = train.data,
method = "rf", importance = TRUE, ntree = 50)
varImp(fit.rf.04)rf variable importance
only 20 most important variables shown (out of 24)
Overall
JobLevel2 100.000
JobLevel3 95.225
TotalWorkingYears 51.181
JobRoleResearch Director 39.175
JobLevel4 29.971
JobRoleSales Executive 24.835
YearsAtCompany 23.536
JobRoleManager 22.585
JobRoleLaboratory Technician 21.594
YearsInCurrentRole 20.869
BusinessTravelTravel_Rarely 17.844
JobLevel5 17.585
JobRoleManufacturing Director 15.224
JobRoleSales Representative 10.796
JobRoleResearch Scientist 10.669
Education2 10.218
Age 9.877
DepartmentSales 9.393
BusinessTravelTravel_Frequently 7.707
Education5 5.336
# broom::tidy(summary(fit.rf.04)) %>% mutate_if(is.numeric, round, 5)
pred.rf.04 <- predict(fit.rf.04, test.data)
RMSE(pred.rf.04, test.data$MonthlyIncome)[1] 1088.056
fit.rf.05 = train(MonthlyIncome ~ JobRole + JobLevel + TotalWorkingYears, data = train.data,
method = "rf", importance = TRUE, ntree = 50)
varImp(fit.rf.05)rf variable importance
Overall
JobLevel2 100.0000
JobLevel3 91.8112
TotalWorkingYears 43.0409
JobRoleLaboratory Technician 29.4250
JobRoleResearch Director 27.2399
JobLevel4 22.3694
JobRoleManager 13.4035
JobRoleManufacturing Director 11.4310
JobRoleSales Executive 6.3037
JobRoleResearch Scientist 5.8021
JobLevel5 5.5704
JobRoleSales Representative 0.1003
JobRoleHuman Resources 0.0000
# broom::tidy(summary(fit.rf.04)) %>% mutate_if(is.numeric, round, 5)
pred.rf.05 <- predict(fit.rf.05, test.data)
RMSE(pred.rf.05, test.data$MonthlyIncome)[1] 1071.911
Naive Bayes
Lets quick try a model with everything in it.
colsToDrop <- c("ID", "EmployeeNumber", "EmployeeCount", "Over18", "StandardHours",
"DailyRate", "HourlyRate", "MonthlyRate")
colsToDrop2 <- names(casedata)[c(37:length(casedata))]
nb.train.01 <- train.data %>% select(-c(colsToDrop, colsToDrop2))
trnCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, summaryFunction = twoClassSummary,
classProbs = TRUE)
fit.nb <- train(Attrition ~ ., data = nb.train.01, method = "naive_bayes", metric = "Spec",
trControl = trnCtrl)
pred <- predict(fit.nb, test.data)
confusionMatrix(pred, test.data$Attrition, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 86 6
Yes 59 21
Accuracy : 0.6221
95% CI : (0.5451, 0.6948)
No Information Rate : 0.843
P-Value [Acc > NIR] : 1
Kappa : 0.2062
Mcnemar's Test P-Value : 1.12e-10
Sensitivity : 0.7778
Specificity : 0.5931
Pos Pred Value : 0.2625
Neg Pred Value : 0.9348
Prevalence : 0.1570
Detection Rate : 0.1221
Detection Prevalence : 0.4651
Balanced Accuracy : 0.6854
'Positive' Class : Yes
[1] 0.7257384
Not too bad. Specificity is just under the desired threshold.
fit.nb.02 <- train(Attrition ~ OverTime + MaritalStatus + StockOptionLevel,
data = nb.train.01, method = "naive_bayes", metric = "Spec", trControl = trnCtrl)
pred.02 <- predict(fit.nb.02, test.data)
confusionMatrix(pred.02, test.data$Attrition, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 126 14
Yes 19 13
Accuracy : 0.8081
95% CI : (0.7412, 0.8641)
No Information Rate : 0.843
P-Value [Acc > NIR] : 0.9107
Kappa : 0.3259
Mcnemar's Test P-Value : 0.4862
Sensitivity : 0.48148
Specificity : 0.86897
Pos Pred Value : 0.40625
Neg Pred Value : 0.90000
Prevalence : 0.15698
Detection Rate : 0.07558
Detection Prevalence : 0.18605
Balanced Accuracy : 0.67522
'Positive' Class : Yes
[1] 0.8842105
fit.nb.03 <- train(Attrition ~ OverTime + MaritalStatus + StockOptionLevel +
JobLevel, data = nb.train.01, method = "naive_bayes", metric = "Spec", trControl = trnCtrl)
pred.03 <- predict(fit.nb.03, test.data)
confusionMatrix(pred.03, test.data$Attrition, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 101 10
Yes 44 17
Accuracy : 0.686
95% CI : (0.611, 0.7545)
No Information Rate : 0.843
P-Value [Acc > NIR] : 1
Kappa : 0.2157
Mcnemar's Test P-Value : 7.098e-06
Sensitivity : 0.62963
Specificity : 0.69655
Pos Pred Value : 0.27869
Neg Pred Value : 0.90991
Prevalence : 0.15698
Detection Rate : 0.09884
Detection Prevalence : 0.35465
Balanced Accuracy : 0.66309
'Positive' Class : Yes
[1] 0.7890625
fit.nb.04 <- train(Attrition ~ OverTime + MaritalStatus + StockOptionLevel +
JobLevel + YearsInCurrentRole, data = nb.train.01, method = "naive_bayes",
metric = "Spec", trControl = trnCtrl)
pred.04 <- predict(fit.nb.04, test.data)
confusionMatrix(pred.04, test.data$Attrition, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 107 11
Yes 38 16
Accuracy : 0.7151
95% CI : (0.6414, 0.7812)
No Information Rate : 0.843
P-Value [Acc > NIR] : 0.9999930
Kappa : 0.2349
Mcnemar's Test P-Value : 0.0002038
Sensitivity : 0.59259
Specificity : 0.73793
Pos Pred Value : 0.29630
Neg Pred Value : 0.90678
Prevalence : 0.15698
Detection Rate : 0.09302
Detection Prevalence : 0.31395
Balanced Accuracy : 0.66526
'Positive' Class : Yes
[1] 0.8136882
nb.train.02 <- train.data %>% select(-c(colsToDrop))
fit.nb.05 <- train(Attrition ~ OverTime + MaritalStatus + StockOptionLevel +
JobLevel + YearsInCurrentRole + IncomeLt4000, data = nb.train.02, method = "naive_bayes",
metric = "Spec", trControl = trnCtrl)
pred.05 <- predict(fit.nb.05, test.data)
confusionMatrix(pred.05, test.data$Attrition, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 106 12
Yes 39 15
Accuracy : 0.7035
95% CI : (0.6292, 0.7706)
No Information Rate : 0.843
P-Value [Acc > NIR] : 0.9999987
Kappa : 0.2037
Mcnemar's Test P-Value : 0.0002719
Sensitivity : 0.55556
Specificity : 0.73103
Pos Pred Value : 0.27778
Neg Pred Value : 0.89831
Prevalence : 0.15698
Detection Rate : 0.08721
Detection Prevalence : 0.31395
Balanced Accuracy : 0.64330
'Positive' Class : Yes
[1] 0.8060837
fit.nb.06 <- train(Attrition ~ OverTime + MaritalStatus + StockOptionLevel +
JobLevel + YearsInCurrentRole + MonthlyIncome + JobInvolvement + WorkingYearsGroup,
data = train.data, method = "naive_bayes", metric = "Spec", trControl = trnCtrl)
pred.06 <- predict(fit.nb.06, test.data)
confusionMatrix(pred.06, test.data$Attrition, positive = "Yes")Confusion Matrix and Statistics
Reference
Prediction No Yes
No 100 8
Yes 45 19
Accuracy : 0.6919
95% CI : (0.6171, 0.7599)
No Information Rate : 0.843
P-Value [Acc > NIR] : 1
Kappa : 0.2525
Mcnemar's Test P-Value : 7.615e-07
Sensitivity : 0.7037
Specificity : 0.6897
Pos Pred Value : 0.2969
Neg Pred Value : 0.9259
Prevalence : 0.1570
Detection Rate : 0.1105
Detection Prevalence : 0.3721
Balanced Accuracy : 0.6967
'Positive' Class : Yes
[1] 0.7905138
Random Forest (Attrition)
trnCtrl <- trainControl( method = "repeatedcv",
number = 10, repeats = 5,
#summaryFunction = twoClassSummary,
classProbs = TRUE)
fit.rf.06=train(Attrition ~ OverTime + MonthlyIncome + StockOptionLevel + TotalWorkingYears + MaritalStatus + YearsWithCurrManager +
Age + YearsAtCompany + YearsInCurrentRole + EducationField + JobRole + JobSatisfaction + Department +
EnvironmentSatisfaction + RelationshipSatisfaction + DistanceFromHome + NumCompaniesWorked + PerformanceRating,
data=train.data, method="rf", importance = TRUE, ntree = 50, trControl = trnCtrl, metric="Sens")
varImp(fit.rf.06)rf variable importance
only 20 most important variables shown (out of 39)
Importance
OverTimeYes 100.00
MonthlyIncome 91.02
YearsInCurrentRole 63.57
TotalWorkingYears 62.52
YearsWithCurrManager 58.07
Age 54.53
EducationFieldMarketing 45.55
MaritalStatusSingle 45.44
JobRoleResearch Scientist 41.00
JobRoleLaboratory Technician 40.44
StockOptionLevel1 40.23
YearsAtCompany 40.23
DepartmentSales 40.03
JobSatisfaction4 38.06
JobRoleResearch Director 37.36
JobRoleSales Representative 36.76
JobSatisfaction2 36.07
EnvironmentSatisfaction4 33.13
RelationshipSatisfaction3 30.64
NumCompaniesWorked 28.72
#broom::tidy(summary(fit.rf.04)) %>% mutate_if(is.numeric, round, 5)
pred.rf.06 <- predict(fit.rf.06, test.data)
confusionMatrix(pred.rf.06,test.data$Attrition)Confusion Matrix and Statistics
Reference
Prediction No Yes
No 136 21
Yes 9 6
Accuracy : 0.8256
95% CI : (0.7605, 0.8791)
No Information Rate : 0.843
P-Value [Acc > NIR] : 0.77179
Kappa : 0.1955
Mcnemar's Test P-Value : 0.04461
Sensitivity : 0.9379
Specificity : 0.2222
Pos Pred Value : 0.8662
Neg Pred Value : 0.4000
Prevalence : 0.8430
Detection Rate : 0.7907
Detection Prevalence : 0.9128
Balanced Accuracy : 0.5801
'Positive' Class : No
[1] 0.9006623
trnCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5, summaryFunction = twoClassSummary,
classProbs = TRUE)
fit.rf.07 = train(Attrition ~ OverTime + MonthlyIncome + StockOptionLevel +
TotalWorkingYears + MaritalStatus + Age + YearsInCurrentRole + EducationField +
JobRole + JobSatisfaction + Department + EnvironmentSatisfaction + RelationshipSatisfaction +
DistanceFromHome + NumCompaniesWorked + PerformanceRating, data = train.data,
method = "rf", importance = TRUE, ntree = 50, trControl = trnCtrl, metric = "Spec")
varImp(fit.rf.07)rf variable importance
only 20 most important variables shown (out of 37)
Importance
OverTimeYes 100.00
MonthlyIncome 94.61
TotalWorkingYears 47.87
JobRoleResearch Scientist 46.75
MaritalStatusSingle 46.68
StockOptionLevel1 46.42
JobRoleSales Representative 38.65
EducationFieldMarketing 33.89
NumCompaniesWorked 31.00
PerformanceRating4 23.15
RelationshipSatisfaction2 21.19
JobSatisfaction4 20.74
Age 19.77
MaritalStatusMarried 19.05
EducationFieldMedical 19.00
DepartmentResearch & Development 18.85
YearsInCurrentRole 18.37
StockOptionLevel2 18.06
DepartmentSales 17.72
JobRoleManager 16.78
# broom::tidy(summary(fit.rf.04)) %>% mutate_if(is.numeric, round, 5)
pred.rf.07 <- predict(fit.rf.07, test.data)
confusionMatrix(pred.rf.07, test.data$Attrition)Confusion Matrix and Statistics
Reference
Prediction No Yes
No 137 21
Yes 8 6
Accuracy : 0.8314
95% CI : (0.7669, 0.8841)
No Information Rate : 0.843
P-Value [Acc > NIR] : 0.70582
Kappa : 0.2078
Mcnemar's Test P-Value : 0.02586
Sensitivity : 0.9448
Specificity : 0.2222
Pos Pred Value : 0.8671
Neg Pred Value : 0.4286
Prevalence : 0.8430
Detection Rate : 0.7965
Detection Prevalence : 0.9186
Balanced Accuracy : 0.5835
'Positive' Class : No
[1] 0.9042904